DOCUMENT RESUME 



ED 294 901 

AUTHOR 
TITLE 

PUB DATE 
NOTE 



PUB T7PE 



EDRS PRICE 
DESCRIPTORS 



TM Oil 554 

Reckase^ Nark D. 

Computerized Adaptive Testing; A Good Idea Waiting 
for the Right Technology. 
Apr 88 

19p. ; Paper presented at the Annual Neetina of the 
American Educational Research Association (New 
Orleans, LA, April 5-9, 1986). 
Speeches/Conference Papers (l50) — Reports - 
Evaluative/Feasibility (142) 

MFOl/PCOl Plus Postage. 

^Adaptive Testing; ^Computer Assisted Testing; 
^Latent Trait Theory; Test Items 



ABSTRACT 

The requirements for adaptive testing are reviewed, 
and the question of why implementation has taken so long is examined. 
The concept of a testing procedure that selects items to match the 
level of performance of an examinee during the administration of a 
test had to wait for the technology necessary to apply the idea. 
Current procedures were developed based on item response theory 
methodology. The reliability of shorter tests and scoring has been 
improved by this approach. Refinement of adaptive testing procedures 
is one aspect currently under development; a second is a focus on 
better ways to model person-l^-item interaction and to produce test 
items to measure a person's skills. (SLD) 
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Adaptive testing has finally reached the point of operational 
implementation. Several large scale testing programs now use adaptive testing 
(Hsu & Sharmis, 1987; Knapp & Wise, 1987; McBride, Corpe & Wing, 1987; Ward, 
Kline & Flaugher, I986), a commercial software system is now available for use 
in developing adaptive tests (Assessment Systems Corporation, 1984), and 
sever^il possible implementations for adaptive testing are under investigation 
(Moreno, 1987; Olsen, Mayner, Slawson & Ho, 1986; Stevenson & Salehi, I986). 
While it is gratifying to those of us who have worked in the area of adaptive 
testing for some time to see it finally reach the point of application, the 
question comes to mind: Why did it take so long? After all, standardized 
adaptive tests have been available since I908 (Binet, I908)* This paper 
reviews the requirements for adaptive testing and suggests an answer to the 
question of why implementation has taken so long. 

For many years, I have been observing the conduct of research in many 
different fields* These observations bave led me to the conclusion that good 
ideas do not have much of an effect until the time is right for those ideas. 
The current theory of plate techtonics is a good example. For many years the 
idea that continents could move was thought to be silly. But eventually, 
after enough empirical evidence was accumulatedT the theory of "sea floor 
spreading" became generally accepted. Similarly, the idea of adapting the 
difficulty of a test to each person tested in large scale testing progranis had 
to wait until the time was right and the necessary technology was available 
before the reasonableness of the idea could be accepted and applied. It took 
approximately 80 years for this acceptance to occur, but now in I988, the 
concept of adaptive testing is finally beginning to be adopted as a practical 
methodology within the set of procedures available to the measurement 
specialists* 
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Saying that the time was right for the concept of adaptive testing to be 
applied is not very informative, however. What changed in the way that test 
constructors thought of tests that allowed the transition to a new type of 
testing methodology to take place? The factors that led to the transition are 
critical to answering the question "Why now?" These will be enumerated in 
some detail after the components of an adaptive testing systeTi are described* 

Definition of Adaptive TestinK 
For the purposes of this paper, adaptive testing is defined as a testing 
procedure that selects items to match the level of performance of an examinee 
during the administration of the test. An operational adaptive test requires 
four components: a set of items from which the test is selected, a procedure 
for selecting the items, a method for computing a test score once the test is 
completed, and a means for determining when testing is done. The components 
of an adaptive test are described in more detail in Green, Bock, Humphreys^ 
Linn & Reckase (1984). 

Stanford-Binet as an Adaptive Test 

In order to clarify the definition of adaptive testing^ the I96O edition 
of the Stanford-Binet (Terman & Merrill, 1960)^ a direct decadent of the Binet 
1908 test^ will be analyzed in some detail* Although the 1960 version will be 
used for convenience^ the basic analysis applies to the earlier versions a3 
well* A critical feature of the Stanford*Binet is that both the scores of 
examinees and the difficulties of test items are reported on the same score 
scale * the mental age scale (MA)- The i960 edition consisted of 122 tasks 
(items) arranged into 20 sets according to their level of difficulty. Items 
within a set were selected for the set because they were answered correctly 
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about half the time by a particular age group. Thus, tasks in the six year 
set were answered correctly approximately half the time by six year olds. 
These 122 tastes composed the item pool of the adaptive test. During the test, 
items were selected to match the examinees ability by starting at a level in 
the 20 sets judged to be appropriate for the examinee and then administering 
easier tasks until a level was reached where all items were answered correctly 
(basal level) and then administering more difficult tasks until all items were 
answered incorrectly (ceiling level) • It was assumed that all tasks below the 
basal level would have a 1,00 probability of a correct response and those 
above the ceiling level would have a 0,0 probability of a correct response. 

The test was scored by adding a specified number of months for each 
correct item to the year designation for the basal level. This scoring 
procedure was in effect estimating the level on the mental age scale at which 
half of the items would be respo.ided to correctly. The stopping rule for the 
procedure was to stop administering items when the basax and ceiling levels 
were determined. 
More Recent Adaptive Tests 

Although the Stanford-Binet type administration method had been in place 
for many years, it was not until the late 1950's to the early I970's that 
attempts were made to adapt the test to the examinee on a larger scale than in 
one-on-one individual examinations. During that period of time, two-stage 
testing (Angoff & Huddleston, 1958), pyramidal testing (Krathwohl & Huyser, 
I956)t and the flexilevel test (Lord, 1971) were investigated. These 
procedures placed test items in a particular structural arrangement based on 
p-values and developed fixed paths through the items to match the test to the 
examinee* Weiss (197^) gives a good summary of these procedures. 
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The currently popular procedures for adaptive testing were developed 
during the late 1960*s and early 1970's based on item response theory (IRT) 
methodology* These procedures used item pools that wer*^ precalibrated using, 
item response theory models rather than group statistics^ such as p-values* 
Items were selected for administration based on mathematical functions of the 
item-parameter estimates, such as item information or the minimum posterior 
variance of a Bayesian procedure* Scoring was also model based using maximum 
likelihood or Bayesian estimat icn procedures* Finally, the test was stopped 
when a decision was made^ a level of precision was reached, or when a fixed 
number of items had been administered* HulLn, Drasgow & Parsons (1983) 
provide a summary of the procedures for IRT-based adaptive testing* For the 
most part, the previously developed structure-based procedures have lost favor 
to these new methods* The two-stage, pyramidal, a^id flexilevel testing 
procedures have never been used in operational testing programs* The next 
section of this paper will discuss each component of an adaptive t2St-xng 
system and indicate why the current procedures have supplanted the earlier 
methods* 



Requirements for a Practical Adaptive Test 
The Stanford-Binet was never used for large scale testing because of the 
requirement for one-on-one administration, the complexity of the 
administration of the items, and the time required for administration and 
scoring* Are there similar problems that led to the rejection of procedures 
like the pyramidal and flexilevel tests, or were the IRT based proceflures 
simply better? Each component of an adaptive test will be considered to 
determine whether links to the reasons for the success of IRT-based procedures 
can be determined* 
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Item Pool 

At the most basic levels the items in an item Pool are not affected by 
the procedure used to select and administer them. The same items could be 
used with any of the available adaptive testing procedures. What has changed, 
however, is the type of information collected about an item. Current adaptive 
testing procedures are based on item statistics determined using item response 
theory procedures. These procedures make it relatively easy to put item 
statistics obtained from the responses of several different groups of people 
to different sets of items on the same scale so that they can all be 
considered in the item selection process • The result is that item pools of 
any required size can be produced by linking together separate calibrations. 

Previous methodology tended to use group-based item statistics, such as 
p-values* to build the structured sets of ite^is. Since p-values for different 
tests Or different samples are likely to be nonlinearly related and to have 
inconvenient scale properties * it was not easy to combine sets of items into a 
single pool. 

The IRT formulation of information also led developers to think about how 
much information should be provided at different points along the score 
scale. Consideration of item pool size and characteristics followed (see 
Patience & Reckase, 1979)- Since previous procedures* with the noticeable 
exception of the Stanford Binet, used group statistics to form the item pool, 
the effects of item pool characteristics were not readily considered. The set 
of items were considered as a unit, not as single pieces that could be used to 
construct a pool with particular characteristics. 
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Item Selection Procedure 

Most current adaptive testing procedures select te t items to maximize 
the information provided at the most recent estimate of ability* Without the 
current computer technology and the development of item response theory 
methodology this item selection methodclogy could not be implemented. It can 
be argued that the previous adaptive testing algorithms were attempting to do 
the same thing, but with less success. Although the Stanford-Binet 
administration procedure administers the most informative items, it also 
administers items that are not very informative by testing until both ceiling 
and basal levels are reached. The items at the ceiling and basal levels are 
too difficult or too easy and are therefore not very informative* Flex i level 
testing tended to give uninformative items as the process continued because 
the items at the extremes of the difficulty range were used. Because of the 
use of uninformative items, more items were used than were necessary to obtain 
an ability estimate at a specified level of precision. Ireland (1976) showed 
that an IRT-based test using the Stanford-Binet item pool could shorten the 
test appreciably without losing reliability* De Ayala & Koch (1986) found the 
flexilevel test to require about twice as many items as a IRT-based adaptive 
test of equivalent reliability. The efficiency gained by the current 
procedures results from considering whether each item will add to the 
information provided by the testing process. 

Scoring Procedure 

The scoring of non-IRT based adaptive testr has always been a problem. 
Classical test theory does not readily deal with cases where different 
examinees get different items and even different numbers of items* IRT-based 
procedures readily produce ability estimates on the same scale after each item 
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has been administered. This feature is a result of including both item- and 
person-parameters in the same model- The Stanford-Binet comes closest to the 
current adaptive testing scoring procedures because both items and people were 
scaled on the same mental age score scale. The other procedures that were 
developed prior to the use of IRT never solved the scoring problem in a way 
that yielded good statistical characteristics. 

Stopping Rules 

Since traditional tests or early adaptive tests had no good means for 
reporting scores on the same scale when different people took different 
numbers of different items, stopping rules seldom were required. Fixed length 
tests were the norm. The Stanford-Binet had a variable test length, but at 
the expense of administering items to each examinee that were too easy and too 
hard. The result was d. lengthy and frustrating testing session. 

Adaptive procedures based on item response theory have substantial 
flexibility in specifying stopping rules. Procedures have been proposed for 
stopping at a specified level of information, at a specified posterior 
variance for Bayesian procedures (Owen, 1975), or when a decision Is made \\ith 
a specified level of certaliity (Reckase, I98O), Of course, fixed length tests 
can also be used. The IRT-based procedures allow the number of items 
administered to be closely tied to the requirements cf test use, 

factors that Distinguish Current 
from Earlier Adaptive Tests 

The differences between current and previous adaptive testing procedures 
given above suggests that certain factors facilitated the development of the 
current adaptive testing procedures. These factors are summarized below. 
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The Item as the Measurement Unit 

The critical feature of adaptive testing Is that the assessment of the 
skill level of an individual is the result of numerous interactions of the 
person and individual test items. Each item provides some amount of 
information about the skill level of the person, bet all items do not provide 
equal amounts of information. This conception of the measurement process is a 
major breakthrough because it leads to the idea that items have 
characteristics that are independent of the group to which they are 
administered. Prior to the work of Lord (1952), items were described using 
statistics based on group performance. In fact, texts prior to Lord's seminal 
work gave very little advise about how to select items for a test* However, 
Gulliksen (1950, 392-393), at least, was aware of the need for relatively 
constant item characteristics and suggested that better item descriptors be a 
goal for future research* 

Classical test theory functioned mainly at the level of the test score 
and its characteristics. Little consideration was given to the effect of 
particular items. The focus on the test score did not encourage researchers 
to match items to each examinee. If a group was well measured, individual 
meascrenent was also expected to be good* 

Item Information 

Once items were considered as individual tools for use in assessing a 
persons ability, the next question was, for what range of ability does an item 
give useful information about the person? The Stanford-Binet answered the 
question by using items that had a probability of correct response for the 
examinee that was neither 1*00 nor 0*00. This resulted in items being used 



10 



9 



over a fairly broad range of ability. This range was typically three to four 
standard deviation units wide. Current theory (see Recka: e & McKinley, 1984) 
suggests that the effective range of an item is only ,7 to 1,8 standard 
deviation units wide. Thus, the Stanford-Binet was administering many items 
that did not provide much information, resulting in a low level of measurement 
efficiency. 

Since most adaptive testing procedures currently used select items to 
mftximize test information^ the concept of item information is clearly 
important to adaptive testing. However, prior to the availability of IRT, the 
concept of information was unknown. There was no classical measurement theory 
analog to the information function. Item quality statistics were all based on 
group performance, Gulliksen (1950)* for example^ was surprised to find that 
the bisei'ial-correlation-item-discrimination inoex changed with the ability 
level of tlie group used to compute the statistics because he considered an 
item to mesisure equally well over the entire ability range. Only with the use 
of IRT concepts can the range of effectiveness of an item be understood, 

Scor ing 

A critical feature of adaptive testing is that the score obtained by an 
examinee is independent of the particular set of items given. Although this 
concept was clearly a part of all adaptive testing procedures, non IRT based 
procedures had difficulty developing a reasonable scoring scheme. For 
example, with fiexilevel testing (Lord, 1971), the score on the test is the 
number of correct responses, plus ,5 if the last response was incorrect. This 
scoring scheme would not yield comparable scores for fiexilevel tests with 
different items. The score on a fiexilevel test was basically equivalent to a 
raw score on a traditional test. 
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In order to truly free the scoring process from the set of items used, 
the item characteristics must be included in the same model as the person 
ability parameter. The Stanford-Binet achieves this by using the mental age 
scale to describe both item difficulty and test score, A five-year item was 
one that approximately fifty percent of five year olds' could answer 
correctly. This placement of items on the score scale allowed the scoring of 
any set of items. Until this same scheme was developed for generic adaptive 
systems, scoring independent of the item set could not be achieved. 

Summary 

It should be clear at this point that the existence of useful adaptive 
testing systems is closely tied to the development of item response theory. 
The Stanford^Binet was, in effect, using item response theory in 1916, but in 
such a restricted way that it could not easily be generalized. Chronological 
age acted as a surrogate for the ability scale, allowing for the scaling of 
both items and people on the same scale. Until IRT becairie available, the 
concepts underlying the Stanford-Binet administration procedure could not be 
understood, and therefore they ^ere not considered to constitute a 
generalizable model. In a sense the Stanford-Binet administration model is 
the equivalent to the theories of plate techtonics mentionea ^^.rlier that were 
not considered viable until the supporting observational Jaca t^are in place. 
In the field of measurement, it took about 80 years for the necessary theory 
to develop that supported the Stanford Binet type administration 
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Future Directions 

Now that the power of item response theory has been realized and is 
taking hold in operational adaptive testing programs^ have all of the basic 
measurement problems been solved? Of course not! The work has j*ist begun. 
Current research on adaptive testing seems to be taking two basic 
directions. The first direction is related to refining the current 
methodciogy. Concerns over whether items function tne same on computer or in 
paper-and-pencil form (Divgi, ^S86) or considerations of now to calibrate 
items as part of an adaptiv^^ test (Samejima, 1988) fall into these 
categories. These are critical areas of research^ but they are not likely to 
result in major advances in testing. 

The second type of research effort focuses on better ways of modeling the 
per son -by -item interaction and of producing test items to measure a person's 
skills. Item response theory models are being produced that are nonmonotonic 
(Thissen & Steinberg, 1984), polychotomous (Sympson^ I986), and/or 
multidimensional (Reckase, 1985), In the future, adaptive tests will be based 
on a more accurate representation of the person-by-item interaction. 
Procedures are also being developed to generate items by computer to m:itch the 
required item characteristics (Bejar, I986), While these procedures have very 
limited capabilities now, if they can be enhanced in the future, adaptive 
tests will have essentially unlimited item pools that do not require 
calibration. The adaptive test of tomorrow may be equivalent to the 
idealized, infinite length tests of today. 

Measurement has reached a new golden age. The number of interesting 
problems and promising approaches to solve them are almost limitless. But a 
key feature, which is an outgrowth of IRT methodology, is consideration of the 
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test item. This is the equivalent for field of measurement of the discovery 
of the cottCept of the atom for chemistry and physics. 
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